Lexicon+TX: rapid construction of a multilingual lexicon with under-resourced languages

نویسندگان

  • Lian Tze Lim
  • Lay-Ki Soon
  • Tek Yong Lim
  • Enya Kong Tang
  • Bali Ranaivo-Malançon
چکیده

Most efforts at automatically creating multilingual lexicons require input lexical resources with rich content (e.g. semantic networks, domain codes, semantic categories) or large corpora. Such material is often unavailable and difficult to construct for under-resourced languages. In some cases, particularly for some ethnic languages, even unannotated corpora are still in the process of collection. We show how multilingual lexicons with under-resourced languages can be constructed using simple bilingual translation lists, which are more readily available. The prototype multilingual lexicon developed comprise six member languages: English, Malay, Chinese, French, Thai and Iban, the last of which is an under-resourced language in Borneo. Quick evaluations showed that 91.2 % of 500 random multilingual entries in the generated lexicon require minimal or no human correction.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Low Cost Construction of a Multilingual Lexicon from Bilingual Lists

Manually constructing multilingual translation lexicons can be very costly, both in terms of time and human effort. Although there have been many efforts at (semi-)automatically merging bilingual machine readable dictionaries to produce a multilingual lexicon, most of these approaches place quite specific requirements on the input bilingual resources. Unfortunately, not all bilingual dictionari...

متن کامل

Pronunciation Lexicon Development for Under-Resourced Languages Using Automatically Derived Subword Units: A Case Study on Scottish Gaelic

Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. To avoid the need for the linguistic knowledge, acoustic information can be used to automatically obtain the subword units and the associated pronunciations. Towards that, the present paper investigates the potential of a rec...

متن کامل

Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon...

متن کامل

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

State-of-the-art automatic speech recognition and text-to-speech systems are based on subword units, typically phonemes. This necessitates a lexicon that maps each word to a sequence of subword units. Development of a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be always readily available, particularly for under-resourced languages. In su...

متن کامل

MultiVal - towards a multilingual valence lexicon

MultiVal is a valence lexicon derived from lexicons of computational HPSG grammars for Norwegian, Spanish and Ga (ISO 639-3, gaa), with altogether about 22,000 verb entries and on average more than 200 valence types defined for each language. These lexical resources are mapped onto a common set of discriminants with a common array of values, and stored in a relational database linked to a web d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Language Resources and Evaluation

دوره 48  شماره 

صفحات  -

تاریخ انتشار 2014